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Abstract 

This  paper  describes  a  decentralized  consistency  protocol  for  survivable  storage  that  exploits  data  versioning  within 
storage-nodes.  Versioning  enables  the  protocol  to  efficiently  provide  linearizability  and  wait-freedom  of  read  and 
write  operations  to  erasure-coded  data  in  asynchronous  environments  with  Byzantine  failures  of  clients  and  servers. 
Exploiting  versioning  storage-nodes,  the  protocol  shifts  most  work  to  clients.  Reads  occur  in  a  single  round-trip 
unless  clients  observe  concurrency  or  write  failures.  Measurements  of  a  storage  system  using  this  protocol  show 
that  the  protocol  scales  well  with  the  number  of  failures  tolerated,  and  that  it  outperforms  a  highly-tuned  instance  of 
Byzantine-tolerant  state  machine  replication. 
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1  Introduction 


Survivable  storage  systems  spread  data  redundantly  across  a  set  of  decentralized  storage-nodes 
in  an  effort  to  ensure  its  availability  despite  the  failure  or  compromise  of  storage-nodes.  Such 
systems  require  some  protocol  to  maintain  data  consistency  and  liveness  in  the  presence  of  failures 
and  concurrency. 

This  paper  describes  and  evaluates  a  new  consistency  protocol  that  operates  in  an  asynchronous 
environment  and  tolerates  Byzantine  failures  of  clients  and  storage-nodes.  The  protocol  supports 
a  hybrid  failure  model  in  which  up  to  t  storage-nodes  may  fail:  b  <t  oi  these  failures  can  be 
Byzantine  the  remainder  can  crash.  The  protocol  requires  at  least  2t  -f-  2^  -f  1  storage-nodes  (i.e., 
=  b).  The  protocol  supports  m-of-n  erasure  codes  (i.e.,  m-of-n  fragments  are  needed  to 
reconstruct  the  data),  which  usually  require  less  network  bandwidth  (and  storage  space)  than  full 
replication  [40,  41]. 

Briefly,  the  protocol  works  as  follows.  To  perform  a  write,  a  client  determines  the  current 
logical  time  and  then  writes  time-stamped  fragments  to  at  least  a  threshold  quorum  of  storage- 
nodes.  Storage-nodes  keep  all  versions  of  fragments  they  are  sent  until  garbage  collection  frees 
them.  To  perform  a  read,  a  client  fetches  the  latest  fragment  versions  from  a  threshold  quorum 
of  storage-nodes  and  determines  whether  they  comprise  a  completed  write;  usually,  they  do.  If 
they  do  not,  additional  and  historical  fragments  are  fetched,  and  repair  may  be  performed,  until  a 
completed  write  is  observed. 

The  protocol  gains  efficiency  from  three  features.  First,  the  space-efficiency  of  m-of-n  erasure 
codes  can  be  substantial.  Second,  most  read  operations  complete  in  a  single  round  trip:  reads 
that  observe  write  concurrency  or  failures  (of  storage-nodes  or  a  client  write)  may  incur  in  addi¬ 
tional  work.  Most  studies  of  distributed  storage  systems  (e.g.,  [3,  26])  indicate  that  concurrency 
is  uncommon  (i.e.,  writer- writer  and  writer-reader  sharing  occurs  in  well  under  1%  of  operations). 
Failures,  although  tolerated,  ought  to  be  rare.  Moreover,  a  subsequent  write  or  read  (with  repair) 
will  replace  a  write  that  has  not  been  completed,  thus  preventing  future  reads  from  incurring  any 
additional  cost;  when  writes  do  the  fixing,  additional  costs  are  never  incurred.  Third,  most  protocol 
processing  is  performed  by  clients,  increasing  scalability  via  the  well-known  principle  of  shifting 
work  from  servers  to  clients  [17]. 

This  paper  describes  the  protocol  in  detail  and  provides  a  proof  sketch  of  its  safety  and  live¬ 
ness.  It  also  describes  and  evaluates  a  prototype  storage  system  called  BASIS  [41],  that  uses  the 
protocol.  Measurements  of  the  prototype  highlight  the  benefits  of  the  three  features.  For  example, 
client  response  times  are  only  20%  higher  when  tolerating  four  Byzantine  storage-node  failures 
than  when  tolerating  one.  For  comparison,  a  well-performing  state  machine  replication  imple¬ 
mentation  [5]  has  a  higher  single-fault  response  time  for  write  operations  and  is  2x  slower  when 
tolerating  four  faults.  Additionally,  the  shifting  of  protocol  processing  to  clients  halves  storage- 
node  CPU  load  and  thereby  doubles  throughput  as  the  number  of  clients  scales  upwards. 


2  Background  and  related  work 

Figure  1  illustrates  the  abstract  architecture  of  a  fault-tolerant,  or  survivable,  distributed  storage 
system.  To  write  a  data-item  D,  Client  A  issues  write  requests  to  multiple  storage-nodes.  To  read 
D,  Client  B  issues  read  requests  to  an  overlapping  subset  of  storage-nodes.  This  scheme  provides 
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Storage-nodes 


Figure  1 :  High-level  architecture  for  survivahle  storage.  Spreading  data  redundantly  aeross 
storage-nodes  improves  its  fault-toleranee.  Clients  write  and  (usually)  read  data  from  multiple 
storage-nodes. 


aeeess  to  data-items  even  when  subsets  of  the  storage-nodes  have  failed. 

A  eommon  data  distribution  seheme  used  in  sueh  systems  is  replieation,  in  whieh  a  writer 
stores  a  repliea  of  the  new  data-item  value  at  eaeh  storage-node  to  whieh  it  sends  a  write  request. 
Alternately,  more  spaee-efficient  erasure  coding  schemes  can  be  used  to  reduce  network  load  and 
storage  consumption. 

To  provide  reasonable  semantics,  the  system  must  guarantee  that  readers  see  consistent  data- 
item  values.  Specifically,  the  linearizability  of  operations  is  desirable  for  a  shared  storage  system. 
Our  protocol  tolerates  Byzantine  faults  of  any  number  of  clients  and  a  limited  number  of  storage 
nodes  while  implementing  linearizable  [16]  and  wait-free  [14]  read-write  objects.  Linearizability 
is  adapted  appropriately  for  Byzantine  clients,  and  wait-freedom  is  as  in  the  model  of  [18]. 

Most  prior  systems  implementing  Byzantine  fault-tolerant  services  adopt  the  state  machine 
approach  [35],  whereby  all  operations  are  processed  by  server  replicas  in  the  same  order  (atomic 
broadcast).  While  this  approach  supports  a  linearizable,  Byzantine  fault-tolerant  implementation 
of  any  deterministic  object,  such  an  approach  cannot  be  wait-free  [12,  14,  18].  Instead,  such 
systems  achieve  liveness  only  under  stronger  timing  assumptions,  such  as  synchrony  (e.g.,  [8,  29, 
37])  or  partial  synchrony  [11]  (e.g.,  [6,  19,  33]),  or  probabilistically  (e.g.,  [4]).  An  alternative 
is  Byzantine  quorum  systems  [22],  from  which  our  protocols  inherit  techniques.  Protocols  for 
supporting  a  linearizable  implementation  of  any  deterministic  object  using  Byzantine  quorums 
have  been  developed  (e.g.,  [23]),  but  also  necessarily  forsake  wait-freedom  to  do  so. 

Byzantine  fault-tolerant  protocols  for  implementing  read- write  objects  using  quorums  are  de¬ 
scribed  in  [15,  22,  24,  28].  Of  these,  only  Martin  et  al.  [24]  achieve  linearizability  in  our  fault 
model,  and  this  work  is  also  closest  to  ours  in  that  it  uses  a  type  of  versioning.  In  our  protocol,  a 
reader  may  retrieve  fragments  for  several  versions  of  the  data-item  in  the  course  of  identifying  the 
return  value  of  a  read.  Similarly,  readers  in  [24]  “listen”  for  updates  (versions)  from  storage-nodes 
until  a  complete  write  is  observed.  Conceptually,  our  approach  differs  by  clients  reading  past  ver- 
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sions,  versus  listening  for  future  versions  broadcast  by  servers.  In  our  fault  model,  especially  in 
consideration  of  faulty  clients,  our  protocol  has  several  advantages.  First,  our  protocol  works  for 
erasure-coded  data,  whereas  extending  [24]  to  erasure  coded  data  appears  nontrivial.  Second,  ours 
provides  better  message  efficiency:  [24]  involves  a  ©(A^^^)  message  exchange  among  the  N  servers 
per  write  (versus  no  server-to-server  exchange  in  our  case)  over  and  above  otherwise  comparable 
(and  linear  in  N)  message  costs.  Third,  ours  requires  less  computation,  in  that  [24]  requires  digital 
signatures  by  clients,  which  in  practice  is  two  orders  of  magnitude  more  costly  than  the  crypto¬ 
graphic  transforms  we  employ.  Advantages  of  [24]  are  that  it  tolerates  a  higher  fraction  of  faulty 
servers  than  our  protocol,  and  does  not  require  servers  to  store  a  potentially  unbounded  number  of 
data-item  versions.  Our  prior  analysis  of  versioning  storage,  however,  suggests  that  the  latter  is  a 
non-issue  in  practice  [39],  and  even  under  attack  this  can  be  managed  using  a  garbage  collection 
mechanism  we  describe  in  Section  6. 

We  contrast  our  use  of  versioning  to  maintain  consistency  with  systems  in  which  each  write 
creates  a  new,  immutable  version  of  a  data-item  to  which  subsequent  reads  are  directed  (e.g.,  [25, 
31,  32]).  Such  systems  merely  shift  the  consistency  and  liveness  problems  to  the  metadata  mech¬ 
anism  that  resolves  data-item  names  to  a  version.  So,  systems  that  employ  such  an  approach  (e.g.. 
Past  [34],  CFS  [9],  Farsite  [2],  and  the  archival  portion  of  OceanStore  [21])  require  a  separate 
protocol  mechanism  to  manage  this  metadata. 

A  goal  of  our  work  is  to  demonstrate  that  our  protocol  is  sufficiently  efficient  to  use  in  practice. 
Castro  and  Liskov  [6]  have  made  available  for  experimentation  a  well-engineered  implementation 
of  a  Byzantine  fault- tolerant  replicated  state  machine.  Their  BFT  library  is  used  elsewhere  in 
the  research  community  (e.g.,  Farsite  [2]).  As  such,  we  use  their  implementation  as  a  point  of 
comparison  to  demonstrate  that  our  protocol  is  efficient  and  scalable  in  both  network  and  storage- 
node  CPU  utilization. 


3  System  model 

We  describe  the  system  infrastructure  in  terms  of  clients  and  storage-nodes.  There  are  N  storage- 
nodes  and  an  arbitrary  number  of  clients  in  the  system. 

A  client  or  storage-node  is  correct  in  an  execution  if  it  satisfies  its  specification  throughout  the 
execution.  A  client  or  storage-node  that  deviates  from  its  specification /az A.  We  assume  a  hybrid 
failure  model  for  storage-nodes.  Up  to  t  storage-nodes  may  fail,  b  <t  oi  which  may  be  Byzantine 
faults;  the  remainder  are  assumed  to  crash.  We  assume  that  Byzantine  storage-nodes  can  collude 
with  each  other  and  with  any  Byzantine  clients.  A  client  or  storage-node  that  does  not  exhibit  a 
Byzantine  failure  (it  is  either  correct  or  crashes)  is  benign. 

The  protocol  tolerates  crash  and  Byzantine  clients.  As  in  any  practical  storage  system,  an 
authorized  Byzantine  client  can  write  arbitrary  values  to  storage,  which  affects  the  value  of  the  data, 
but  not  its  consistency.  We  assume  that  Byzantine  clients  and  storage-nodes  are  computationally 
bounded  so  that  we  can  employ  cryptographic  primitives. 

We  assume  an  asynchronous  model  of  time  (i.e.,  we  make  no  assumptions  about  message 
transmission  delays  or  the  execution  rates  of  clients  and  storage-nodes).  We  assume  that  com¬ 
munication  between  a  client  and  a  storage-node  is  point-to-point,  reliable,  and  authenticated:  a 
correct  storage-node  (client)  receives  a  message  from  a  correct  client  (storage-node)  if  and  only  if 
that  client  (storage-node)  sent  it  to  it. 
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There  are  two  types  of  operations  in  the  protocol  —  read  operations  and  write  operations 
—  both  of  which  operate  on  data-items.  Clients  perform  read/write  operations  that  issue  multiple 
read/write  requests  to  storage-nodes.  A  read/write  request  operates  on  a  data-fragment.  A  data- 
item  is  encoded  into  data- fragments.  Clients  may  encode  data-items  in  an  erasure-tolerant  manner; 
thus  the  distinction  between  data-item  and  data-fragment.  Requests  are  executed  by  storage-nodes; 
a  correct  storage-node  that  executes  a  write  request  hosts  that  write  operation. 

Storage-nodes  provide  fine-grained  versioning,  meaning  that  a  correct  storage-node  hosts  a 
version  of  the  data-fragment  for  each  write  request  it  executes.  There  is  a  well  known  zero  time, 
0,  and  null  value,  ±,  which  storage-nodes  can  return  in  response  to  read  requests.  Implicitly,  all 
stored  data  is  initialized  to  ±  at  time  0. 


4  Protocol 

This  section  describes  our  Byzantine  fault-tolerant  consistency  protocol  that  efficiently  supports 
erasure-coded  data-items  by  taking  advantage  of  versioning  storage-nodes.  It  presents  the  mecha¬ 
nisms  employed  to  encode  and  decode  data,  and  to  protect  data  integrity  from  Byzantine  storage- 
nodes  and  clients.  It  then  describes,  in  detail,  the  protocol  in  pseudo-code  form.  Finally,  it  develops 
constraints  on  protocol  parameters  to  ensure  the  safety  and  liveness  of  the  protocol. 

4.1  Overview 

At  a  high  level,  the  protocol  proceeds  as  follows.  Logical  timestamps  are  used  to  totally  order 
all  write  operations  and  to  identify  data-fragments  pertaining  to  the  same  write  operation  across 
the  set  of  storage-nodes.  For  each  write,  a  logical  timestamp  is  constructed  by  the  client  that  is 
guaranteed  to  be  unique  and  greater  than  that  of  the  latest  complete  write  (the  complete  write  with 
the  highest  timestamp).  This  is  accomplished  by  querying  storage-nodes  for  the  greatest  timestamp 
they  host,  and  then  incrementing  the  greatest  response.  In  order  to  verify  the  integrity  of  the  data, 
a  hash  that  can  verify  data-fragment  correctness  is  appended  to  the  logical  timestamp. 

To  perform  a  read  operation,  clients  issue  read  requests  to  a  subset  of  storage-nodes.  Once  at 
least  a  read  quorum  of  storage-nodes  reply,  the  client  identifies  the  candidate — the  response  with 
the  greatest  logical  timestamp.  The  set  of  read  responses  that  share  the  timestamp  of  the  candidate 
comprise  the  candidate  set.  The  read  operation  classifies  the  candidate  as  complete,  repairable, 
or  incomplete.  If  the  candidate  is  classified  as  complete,  the  data-fragments,  timestamp,  and  re¬ 
turn  value  are  validated.  If  validation  is  successful,  the  value  of  the  candidate  is  returned  and  the 
read  operation  is  complete;  otherwise,  the  candidate  is  reclassified  as  incomplete.  If  the  candi¬ 
date  is  classified  as  repairable  it  is  repaired  by  writing  data-fragments  back  to  the  original  set  of 
storage-nodes.  Prior  to  performing  repair,  data-fragments  are  validated  in  the  same  manner  as  for 
a  complete  candidate.  If  the  candidate  is  classified  as  incomplete,  the  candidate  is  discarded,  pre¬ 
vious  data-fragment  versions  are  requested,  and  classification  begins  anew.  All  candidates  fall  into 
one  of  the  three  classifications,  even  those  corresponding  to  concurrent  or  failed  write  operations. 
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4.2  Mechanisms 

Several  mechanisms  are  used  in  our  protocol  to  achieve  linearizability  in  the  presence  of  Byzantine 
faults. 

4.2.1  Erasure  codes 

In  an  erasure  coding  scheme,  N  data-fragments  are  generated  during  a  write  (one  for  each  storage- 
node),  and  any  m  of  those  data-fragments  can  be  used  to  decode  the  original  data-item.  Any 
m  of  the  data-fragments  can  deterministically  generate  the  other  A  —  m  data-fragments.  We  use  a 
systematic  information  dispersal  algorithm  [30],  which  stripes  the  data-item  across  the  first  m  data- 
fragments  and  generates  erasure-coded  data-fragments  for  the  remainder.  Other  threshold  erasure 
codes  (e.g..  Secret  Sharing  [36]  and  Short  Secret  Sharing  [20])  work  as  well. 

4.2.2  Data-fragment  integrity 

Byzantine  storage-nodes  can  corrupt  their  data-fragments.  As  such,  it  must  be  possible  to  detect 
and  mask  up  to  b  storage-node  integrity  faults. 

Cross  checksums:  Cross  checksums  [13]  are  used  to  detect  corrupt  data-fragments.  A  crypto¬ 
graphic  hash  of  each  data-fragment  is  computed.  The  set  of  N  hashes  are  concatenated  to  form  the 
cross  checksum  of  the  data-item.  The  cross  checksum  is  stored  with  each  data-fragment  (i.e.,  it  is 
replicated  N  times).  Cross  checksums  enable  read  operations  to  detect  data-fragments  that  have 
been  modified  by  storage-nodes. 

4.2.3  Write  operation  integrity 

Mechanisms  are  required  to  prevent  Byzantine  clients  from  performing  a  write  operation  that  lacks 
integrity.  If  a  Byzantine  client  generates  random  data-fragments  (rather  than  erasure  coding  a 
data-item  correctly),  then  each  of  the  (^)  permutations  of  data-fragments  could  “recover”  a  dis¬ 
tinct  data-item.  Additionally,  a  Byzantine  client  could  partition  the  set  of  N  data-fragments  into 
subsets  that  each  decode  to  a  distinct  data-item.  These  attacks  are  similar  to  poisonous  writes  for 
replication  as  described  by  Martin  et  al.  [24].  To  protect  against  Byzantine  clients,  the  protocol 
must  ensure  that  read  operations  only  return  values  that  are  written  correctly  (i.e.,  that  are  single¬ 
valued).  To  achieve  this,  the  cross  checksum  mechanism  is  extended  in  three  ways:  validating 
timestamps,  storage-node  verification,  and  validated  cross  checksums. 

Validating  timestamps:  To  ensure  that  only  a  single  set  of  data-fragments  can  be  written  at 
any  logical  time,  the  hash  of  the  cross  checksum  is  placed  in  the  low  order  bits  of  the  logical 
timestamp.  Note,  the  hash  is  used  for  space-efficiency;  instead,  the  entire  cross  checksum  could  be 
placed  in  the  low  bits  of  the  timestamp. 

Storage-node  verification:  On  a  write,  each  storage-node  verifies  its  data-fragment  against 
its  hash  in  the  cross  checksum.  The  storage-node  also  verifies  the  cross  checksum  against  the 
hash  in  the  timestamp.  A  correct  storage-node  only  executes  write  requests  for  which  both  checks 
pass.  Thus,  a  Byzantine  client  cannot  make  a  correct  storage-node  appear  Byzantine.  It  follows, 
that  only  Byzantine  storage-nodes  can  return  data-fragments  that  do  not  verify  against  the  cross 
checksum. 
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Validated  cross  checksums:  Storage-node  verification  combined  with  a  validating  times¬ 
tamp  ensures  that  the  data-fragments  being  considered  by  a  read  operation  were  written  by  the 
client  (as  opposed  to  being  fabricated  by  Byzantine  storage-nodes).  To  ensure  that  the  client  that 
performed  the  write  operation  acted  correctly,  the  cross  checksum  must  be  validated  by  the  reader. 
To  validate  the  cross  checksum,  all  N  data-fragments  are  required.  Given  any  m  data-fragments, 
the  full  set  of  N  data-fragments  a  correct  client  should  have  written  can  be  generated.  The  “correct” 
cross  checksum  can  then  be  computed  from  the  regenerated  set  of  data-fragments.  If  the  generated 
cross  checksum  does  not  match  the  verified  cross  checksum,  then  a  Byzantine  client  performed  the 
write  operation.  Only  a  single-valued  write  operation  can  generate  a  cross  checksum  that  verifies 
against  the  validating  timestamp. 

4.3  Pseudo-code 

The  pseudo-code  for  the  protocol  is  shown  in  Figures  2  and  3.  The  symbol  2w  defines  a  complete 
write  threshold — the  number  of  write  responses  a  client  must  observe  to  know  that  the  write  opera¬ 
tion  is  complete.  Thus,  a  write  is  complete  once  2w  —  b  benign  storage-nodes  have  executed  write 
requests  with  timestamp  ts.  Note  that  Qw  is  a  threshold  quorum;  as  such,  it  represents  a  scalar 
value,  not  a  set.  The  symbol  LT  denotes  logical  time  and  LTcandidate  denotes  the  logical  time  of  the 
candidate.  The  set  {Di, . . .  ,Dt^}  denotes  the  N  data-fragments;  likewise,  {5i, . . .  ,5yv}  denotes  the 
set  of  N  storage-nodes.  In  the  pseudo-code,  the  operator  ‘|’  denotes  string  concatenation. 

4.3.1  Storage-node  interface 

Storage-nodes  offer  interfaces  to  write  a  data-fragment  at  a  specific  logical  time;  to  query  the 
greatest  logical  time  of  a  hosted  data-fragment;  to  read  the  hosted  data-fragment  with  the  greatest 
logical  time;  and  to  read  the  hosted  data-fragment  with  the  greatest  logical  time  at  or  before  some 
logical  time.  Given  the  simplicity  of  the  storage-node  interface,  storage-node  pseudo-code  has 
been  omitted. 

4.3.2  Write  operation 

The  WRITE  operation  consists  of  determining  the  greatest  logical  timestamp,  constructing  write 
requests,  and  issuing  the  requests  to  the  storage-nodes.  First,  a  timestamp  greater  than,  or  equal  to, 
that  of  the  latest  complete  write  must  be  determined.  Collecting  more  than  V-f-  —  2w  responses, 
on  line  8  of  WRITE,  is  necessary  to  guarantee  that  the  response  set  overlaps  a  complete  write  at 
a  minimum  of  b+l  storage-nodes.  Thus,  the  constraint  ensures  that  the  timestamp  of  the  latest 
complete  write  is  observed. 

Next,  the  ENCODE  function,  on  line  10,  encodes  the  data-item  into  N  data-fragments.  The 
data-fragments  are  used  to  construct  a  cross  checksum  from  the  concatenation  of  the  hash  of  each 
data-fragment  (line  11).  The  function  MAKE  .TIMESTAMP,  called  on  line  12,  generates  a  logical 
timestamp  to  be  used  for  the  current  write  operation.  This  is  done  by  incrementing  the  high  order 
bits  of  the  greatest  observed  logical  timestamp  from  the  ResponseSet  (i.e.,  LT.TIME),  appending 
the  client’s  ID,  and  appending  the  Verifier.  The  Verifier  is  just  the  hash  of  the  cross  checksum. 

Finally,  write  requests  are  issued  to  all  storage-nodes.  Each  storage-node  is  sent  a  specific 
data-fragment,  the  logical  timestamp,  and  the  cross  checksum.  A  storage-node  validates  the  cross 
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WRITE(Data)  : 

1 :  /*  Determine  greatest  existing  logical  timestamp  * / 

2:  for  all  S,  e  {Si, . . .  do 
3:  SEND(S,,  TIMEJtEQUEST) 

4:  end  for 
5:  ResponseSet  := 

6:  repeat 

7 :  ResponseSet  :=  ResponseSet  U  RECEIVE(S,  TIME JIESPONSE) 

8:  until  l\ResponseSet\  >  N  +  b  —  Qw) 

9:  /*  Generate  data-fragments,  cross  checksum  and  logical  timestamp  *  / 
10:  {Z)i,..., Z)iv}  :=ENCQDE(D£!m) 

11:  CC  :=  MAKE_CROSS_CHECKSUM({Di , . . .  ,Djv}) 

12:  LT  ■=  MAKE_TIMESTAMP(MAX[RciponieS£f.Lr],  CC) 

13:  /*  Write  out  the  data-fragments  *  / 

14:  DO_WRITE({Di,...,Div},Lr,  CC) 

MAKE_CRDSS_CHECKSUM({Di , . . .  ,Div})  : 

1:  for  all  D,  e  {Z)i,...,Div}  do 

2:  Hi  ■=  HASH(A) 

3 :  end  for 
4:  cC\=Hi\...\Hn 
5:  RETURN  (CC) 

MAKE_TIMESTAMP{Lrn,ax,  CC)  : 

1:  LT.TIME  =LT„^^.TIME+l 

2:  LT.ID  -.=  ID 

3:  LT. Verifier  ■=m.S^{CC) 

4:  RETURN  (Lr) 

DO_WRITE({Di,...,Dw},Cr,  CC)  : 

1:  for  all  S,  e  (Si, . . .  ,Sw}  do 
2:  SEND(S,,  WRITEJlEqUEST,  LT,  Di,  CC) 

3 :  end  for 
4:  ResponseSet  :=  0 
5:  repeat 

6:  ResponseSet  :=  ResponseSet  U  RECEIVE(S,  WRITE  JIESPONSE) 

7:  until  )\ResponseSet\  =  Qw) 


Figure  2:  Write  operation  pseudo-code. 

checksum  with  the  verifier  and  validates  the  data-fragment  with  the  cross  checksum  before  exe¬ 
cuting  a  write  request  (i.e.,  storage-nodes  call  VALIDATE  listed  in  the  read  operation  pseudo-code). 
The  write  operation  returns  to  the  issuing  client  once  at  least  2w  WRITE_RESPONSE  messages  are 
received  (line  7  of  D0_WRITE). 

4.3.3  Read  operation 

The  read  operation  iteratively  identifies  and  classifies  candidates,  until  a  repairable  or  complete 
candidate  is  found.  Once  a  repairable  or  complete  candidate  is  found,  the  read  operation  vali¬ 
dates  its  correctness  and  returns  the  data.  A  client  must  observe  at  least  data-fragments  with 
matching  logical  timestamps  in  order  to  classify  a  candidate  as  complete.  To  ensure  that  a  can¬ 
didate  that  corresponds  to  a  complete  write  is  classified  as  such  as  often  as  possible,  N  —  t  data- 
fragments  are  considered  for  classification  (line  10  of  READ).  Note  that  the  read  operation  returns 
a  {timestamp,  value)  pair;  in  practice,  a  client  only  makes  use  of  the  value  returned. 

The  read  operation  begins  by  issuing  READ  LATEST  JIEQUEST  commands  to  all  storage-nodes 
(via  the  D0_READ  function).  Each  storage-node  responds  with  the  data-fragment,  logical  timestamp. 
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READ()  : 

1 :  ResponseSet  :=  DO_READ(READ_LATESTJlEQUEST,  _L) 

2:  loop 

3:  {CandidateSet.  LTcmdidate)  :=  CH00SE_CANDIDATE(Re.spon«e5ef) 

4:  if  ( I  CandidateSet\  >  Qyf  —  t  —  b)  then 

5 :  /*  Complete  or  repairable  write  found  *  / 

6:  {Du--  -,Dn}  :=  GENERATE_FRAGMENTS(Canrf(yateSef) 

7:  CCvaiid  :=MAKE_CROSS_CHECKSUM({Z)i,...,Z)jv}) 

8:  if  (CCvaiid  =  CandidateSet.ee)  then 

9:  /*  Validated  cross  checksums  match  */ 

10;  if  {\CandidateSet\  <  2w)  then 

11:  /*  Repair  is  necessary  * / 

12:  D0_WRITE({Di  Lrcandidate,  CCvaiid) 

1 3 :  end  if 

14;  Data  :=DEC0DE({Di,...,Dw}) 

15:  RETURN) (Lrcandidate,  Oato)) 

16:  end  if 

17:  end  if 

18;  /*  Partial  or  data-item  validation  failed,  loop  again  *  / 

19;  ResponseSet  :=  DO  JIEAD(READ PREVIOUS  JIEQUEST,  LPeandidate) 

20:  end  loop 

DQ_READ(READ .COMMAND,  LT)  : 

1:  for  alls,  G  {Si,...,Sw}  do 
2:  SEND(Si ,  READ.COMMAND,  LT) 

3 :  end  for 
4:  ResponseSet  :=  0 
5:  repeat 

6:  Resp  :=  RECEIVE(S,  READ  .RESPONSE) 

7:  if  (VALIDATE(Reip.D,  Resp.CC,  Resp.LT,  S)  =  TRUE)  then 

8;  ResponseSet  :=  ResponseSet  U  Resp 

9:  end  if 

10:  aati\  (\ResponseSet\  =  N  —  t) 

1 1 :  RETURN  (ResponseSet) 

VALIDATE(D,  CC,  LT,  S)  : 

1 :  if  ((HASH(CC)  ^  LT. Verifier)  OR  (HASH(D)  CC[S]))  then 
2:  RETURN(FALSE) 

3 :  end  if 
4:  RETURN(TRUE) 

Figure  3:  Read  operation  pseudo-code. 

and  cross  checksum  corresponding  to  the  greatest  timestamp  it  has  executed. 

The  integrity  of  each  response  is  individually  validated  through  the  VALIDATE  function,  called 
on  line  7  of  D0_READ.  This  function  checks  the  cross  checksum  against  the  Verifier  found  in  the 
logical  timestamp  and  the  data-fragment  against  the  appropriate  hash  in  the  cross  checksum. 

Since,  in  an  asynchronous  system,  slow  storage-nodes  cannot  be  differentiated  from  crashed 
storage-nodes,  only  N  —  t  read  responses  can  be  collected  (line  10  of  D0_READ).  Since  correct 
storage-nodes  perform  the  same  validation  before  executing  write  requests,  the  only  responses 
that  can  fail  the  client’s  validation  are  those  from  Byzantine  storage-nodes.  For  every  discarded 
Byzantine  storage-node  response,  an  additional  response  can  be  awaited. 

After  sufficient  responses  have  been  received,  a  candidate  for  classification  is  chosen.  The 
function  CHOOSE.CANDIDATE,  on  line  3  of  READ,  determines  the  candidate  timestamp,  denoted 
^T’eandidatej  which  is  the  greatest  timestamp  found  in  the  response  set.  All  data- fragments  that  share 
^T’candidate  are  identified  and  returned  as  the  candidate  set.  At  this  point,  the  candidate  set  contains 
a  set  of  validated  data-fragments  that  share  a  common  cross  checksum  and  logical  timestamp. 
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Once  a  candidate  has  been  chosen,  it  is  classified  as  either  complete,  repairable,  or  incomplete. 
Given  that  only  N  —  t  read  responses  can  be  collected,  a  completed  write  at  2w  storage-nodes,  b 
of  which  may  be  Byzantine  (and  capable  of  “hiding”  the  write),  may  have  as  few  as  2w  —  t  —  b 
data-fragments  visible  to  a  later  read.  Accounting  for  this  lack  of  information,  the  classification 
rules  are: 

•  \CandidateSet\  >  Qw,  the  candidate  is  classified  as  complete; 

•  Qw-t-b  <  \CandidateSet\  <  the  candidate  is  classified  as  repairable; 

•  \CandidateSet\  <  2w  —  t  —  b,  the  candidate  is  classified  as  incomplete; 

If  the  candidate  is  classified  as  incomplete,  a  READ_PREVI0USJ1,EQUEST  message  is  sent  to 
each  storage-node  with  the  candidate  timestamp.  Candidate  classification  begins  again  with  the 
new  response  set. 

If  the  candidate  is  classified  as  either  complete  or  repairable,  the  candidate  set  contains  suffi¬ 
cient  data-fragments  written  by  the  client  to  decode  the  original  data-item.  To  validate  the  observed 
write’s  integrity,  the  candidate  set  is  used  to  generate  a  new  set  of  data-fragments  (line  6  of  READ). 
A  validated  cross  checksum,  CCvaiid,  is  computed  from  the  newly  generated  data-fragments.  The 
validated  cross  checksum  is  compared  to  the  cross  checksum  of  the  candidate  set  (line  8  of  READ). 
If  the  check  fails,  the  candidate  was  written  by  a  Byzantine  client;  the  candidate  is  reclassified  as 
incomplete  and  the  read  operation  continues.  If  the  check  succeeds,  the  candidate  was  written  by 
a  correct  client  and  the  read  enters  its  final  phase.  Note  that  this  check  either  succeeds  or  fails  for 
all  correct  clients  regardless  of  which  storage-nodes  are  represented  within  the  candidate  set. 

If  necessary,  repair  is  performed:  write  requests  are  issued  with  the  generated  data-fragments, 
the  validated  cross  checksum,  and  the  logical  timestamp  (line  10  of  READ).  Storage-nodes  not 
currently  hosting  the  write  execute  the  write  at  the  given  logical  time;  those  already  hosting  the 
write  are  safe  to  ignore  it.  Finally,  the  function  DECODE,  on  line  14  of  READ,  decodes  m  data- 
fragments,  returning  the  data-item. 

It  should  be  noted  that,  even  after  a  write  completes,  it  may  be  classified  as  repairable  by  a 
subsequent  read,  but  it  will  never  be  classified  as  incomplete.  For  example,  this  will  occur  if  the 
read  set  (of  A  —  t  storage-nodes)  does  not  fully  encompass  the  write  set  (of  2w  storage-nodes). 

4.4  Protocol  constraints 

To  ensure  that  linearizability  and  liveness  are  achieved,  2w  and  N  must  be  constrained  with  regard 
to  b,t,  and  each  other.  As  well,  the  parameter  m,  used  in  DECODE,  must  be  constrained.  We  sketch 
the  proof  that  the  protocol  under  these  constraints  is  safe  and  live  in  Appendix  I. 

Write  termination:  To  ensure  write  operations  are  able  to  complete  in  an  asynchronous  envi¬ 
ronment, 

Qw<N-t.  (1) 

Since  slow  storage-nodes  cannot  be  differentiated  from  crashed  storage-nodes,  only  N  —  t  re¬ 
sponses  can  be  awaited. 

Overlap:  To  ensure  linearizability,  the  function  CHOOSE.CANDIDATE  must  select  the  latest  com¬ 
plete  write  as  the  candidate.  For  this  to  occur,  the  response  set  of  a  read  operation  (or  a  get  logical 
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time  operation)  must  intersect  with  at  least  one  correct  (non-Byzantine)  storage-node  that  executed 
a  write  request  of  a  complete  write.  Let  Qr  be  the  size  of  the  read  response  set, 

N  +  b<  2w  +  2r, 

A^  +  ^-ew<eR.  (2) 

This  is  the  source  of  the  constraint  on  line  8  of  WRITE. 

Real  repairable  candidates:  If  Byzantine  storage-nodes,  through  collusion,  can  fabricate  a 
candidate  that  a  client  deems  repairable,  then  the  storage-nodes  can  “trick”  a  client  into  repairing 
a  value  that  was  never  entered  into  the  system  by  an  authorized  client.  To  ensure  that  Byzantine 
storage-nodes  can  never  fabricate  a  repairable  candidate,  a  candidate  set  of  size  b  must  be  classifi¬ 
able  as  incomplete.  Since  a  candidate  is  repairable  if  2w  —  t  —  bor  more  responses  are  observed 
(recall  the  classification  rules),  then 


b<Q^-t-b, 

t  +  2b<QyN-  (3) 

Constraint  summary:  Constraints  (1)  and  (3)  yield  overall  constraints  on  2w  and  N\ 

t  -\- 2b <  2w  ^  N  —  t^ 

2t  +  2b+l<N. 

Repairable  candidates:  For  repair  to  be  possible,  a  repairable  candidate  must  be  decodable. 
The  fewest  responses  a  repairable  candidate  can  have  is  2w  —  t  —  b,  therefore, 

1  <  m  <  2w  —  t  —  b. 

So,  m  can  always  be  at  least  b-\-l  (note:  larger  values  of  m  yield  more  space-efficient  erasure 
coding).  If  a  repairable  candidate  is  decodable,  then  a  complete  candidate  is  clearly  also  decodable. 


5  Evaluation 

This  section  evaluates  the  consistency  protocol’s  performance  in  the  context  of  a  prototype  storage 
system  called  BASIS  [41].  We  also  compare  the  BASIS  implementation  of  our  protocol  with  the 
BFT  library  implementation  [7],  an  efficient  implementation  of  the  BFT  protocol  for  replicated 
state  machines  [5]. 

5.1  Prototype  implementation 

BASIS  consists  of  clients  and  storage-nodes.  Storage-nodes  store  data-fragments  and  their  ver¬ 
sions.  Clients  execute  the  protocol  to  read  and  write  data- items. 
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5.1.1  Storage-node  implementation 

Our  storage-nodes  use  the  Comprehensive  Versioning  File  System  (CVFS)  [38]  to  retain  data- 
fragments  and  their  versions.  CVFS  uses  a  log-struetured  data  organization  to  reduee  the  eost  of 
data  versioning.  Experienee  indieates  that  retaining  every  version  and  performing  loeal  garbage 
eolleetion  eomes  with  minimal  performanee  eost  (a  few  pereent)  and  that  it  is  feasible  to  retain 
eomplete  version  histories  for  several  days  [38,  39]. 

We  extended  CVFS  to  provide  an  interfaee  for  retrieving  the  logieal  timestamp  of  a  data- 
fragment.  Eaeh  write  request  eontains  a  data-fragment,  a  logieal  timestamp,  and  a  eross  eheeksum. 
In  a  normal  read  response,  storage-nodes  return  all  three.  To  improve  performanee,  read  responses 
eontain  a  limited  version  history  eontaining  logieal  timestamps  of  previously  exeeuted  write  re¬ 
quests.  The  version  history  allows  elients  to  identify  and  elassify  additional  eandidates  without 
issuing  extra  read  requests.  The  storage-node  ean  also  return  read  responses  that  eontain  no  data 
other  than  version  histories  (this  makes  eandidate  elassifieation  more  network  effieient). 

5.1.2  Client  implementation 

The  elient  module  provides  a  bloek-level  interfaee  to  higher  level  software,  and  uses  a  simple 
RPC  interfaee  to  eommunieate  with  storage-nodes.  The  RPC  meehanism  uses  TCP/IP.  The  elient 
module  is  responsible  for  the  exeeution  of  the  eonsisteney  protoeol  and  for  eneoding  and  deeoding 
data- items. 

Initially,  read  requests  are  issued  to  the  first  2w  storage-nodes.  Only  m  of  these  request  the 
data-fragment,  while  all  request  version  histories;  this  makes  the  read  operation  more  network 
effieient.  If  the  read  responses  do  not  yield  a  eandidate  that  is  elassified  as  eomplete,  read  requests 
are  issued  to  the  remaining  storage-nodes  (and  a  total  of  up  to  V  —  t  responses  are  awaited).  If 
the  initial  eandidate  is  elassified  as  ineomplete,  subsequent  rounds  of  read  requests  feteh  only 
version  histories  until  a  eandidate  is  elassified  as  either  repairable  or  eomplete.  If  neeessary,  after 
elassifieation,  extra  data-fragments  are  fetehed  aeeording  to  the  eandidate  timestamp.  Onee  the 
data-item  is  sueeessfully  validated  and  deeoded,  it  is  returned  to  the  elient. 

5.1.3  Mechanism  implementation 

We  measure  the  spaee-effieieney  of  an  erasure  eode  in  terms  of  blowup — the  amount  of  data  stored 
over  the  size  of  the  data-item.  We  use  an  information  dispersal  algorithm  [30]  whieh  has  a  blowup 
of  Our  information  dispersal  implementation  stripes  the  data-item  aeross  the  first  m  data- 
fragments  (i.e.,  eaeh  data-fragment  is  ^  of  the  original  data-item’s  size).  These  stripe-fragments 
are  used  to  generate  the  code-fragments  via  polynomial  interpolation  within  a  Galois  Eield,  whieh 
treats  the  stripe-fragments  and  eode-fragments  as  points  on  some  m  —  1  degree  polynomial.  Our 
implementation  of  polynomial  interpolation  was  originally  based  on  publiely  available  eode  for 
information  dispersal  [10].  We  modified  the  souree  to  make  use  of  stripe-fragments  and  added  an 
implementation  of  Galois  Eields  of  size  2^  that  use  a  lookup  table  for  multiplieation. 

Our  implementation  of  eross  eheeksums  elosely  follows  Gong  [13].  We  use  a  publiely  avail¬ 
able  implementation  of  MD5  for  all  hashes  [1].  Eaeh  MD5  hash  is  16  bytes  long;  thus,  eaeh  eross 
eheeksum  is  V  x  16  bytes  long. 
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5.2  Experimental  setup 

We  use  a  cluster  of  20  machines  to  perform  experiments.  Each  machine  is  a  dual  IGHz  Pentium 
III  machine  with  384  MB  of  memory.  Each  storage-node  uses  a  9GB  Quantum  Atlas  lOK  as  the 
storage  device.  The  machines  are  connected  through  a  100Mb  switch.  All  machines  run  the  Einux 
2.4.20  SMP  kernel. 

In  all  experiments,  clients  keep  a  fixed  number  of  read  and  write  operations  outstanding.  Once 
an  operation  completes,  a  new  operation  is  issued  (there  is  no  think  time).  Unless  otherwise  speci¬ 
fied,  requests  are  for  random  16  KB  blocks.  Eor  all  experiments,  the  working  set  fits  into  memory 
and  all  caches  are  warmed  up  beforehand. 

Most  experiments  focus  on  configurations  where  b  =  t  and  t  =  {1,2, 3, 4}.  Thus,  for  our 
protocol,  N  =  Ab+ I,  2w  =  3h-|-  1,  and  m  =  b-\-l.  Eor  BET,  N  =  3b  +  I  (i.e.,  N  =  (4,7, 10, 13}). 

5.2.1  PASIS  configuration 

Each  storage-node  is  configured  with  128  MB  of  data  cache,  and  no  caching  is  done  on  the  clients. 
All  experiments  show  results  using  write-back  caching  at  the  storage  nodes,  mimicking  availability 
of  16  MB  of  non-volatile  RAM.  This  allows  us  to  focus  experiments  on  the  overheads  introduced 
by  the  protocol  and  not  those  introduced  by  the  disk  subsystem.  No  authentication  is  currently 
performed  in  the  PASIS  prototype  (authentication  is  expected  to  have  little  performance  impact). 

5.2.2  BFT  configuration 

Operations  in  BET  [5]  require  agreement  among  the  replicas  (storage-nodes  in  PASIS).  Agree¬ 
ment  is  performed  in  four  steps:  (i)  the  client  broadcasts  requests  to  all  replicas;  (ii)  the  primary 
broadcasts  pre-prepare  messages  to  all  replicas;  (hi)  all  replicas  broadcast  prepare  messages  to 
all  replicas;  and,  (iv)  all  replicas  send  replies  back  to  the  client  and  then  broadcast  commit  mes¬ 
sages  to  all  other  replicas.  Commit  messages  are  piggy-backed  on  the  next  pre-prepare  or  prepare 
message  to  reduce  the  number  of  messages  on  the  network.  Authenticators,  chains  of  MACs,  are 
used  to  ensure  that  broadcast  messages  from  clients  and  replicas  cannot  be  modified  by  a  Byzan¬ 
tine  replica.  All  clients  and  replicas  have  public  and  private  keys  that  enables  them  to  exchange 
symmetric  cryptography  keys  used  to  create  MACs.  Eogs  of  commit  messages  are  checkpointed 
(garbage  collected)  periodically.  View  changes,  in  which  a  new  primary  is  selected,  are  performed 
periodically. 

A  fast  path  for  read  operations  is  implemented  in  BET.  The  client  broadcasts  its  request  to 
all  replicas.  Each  replica  replies  once  all  messages  previous  to  the  request  are  committed.  Only 
one  replica  sends  the  full  reply  (i.e.,  the  data  and  digest),  and  the  remainder  just  send  digests  that 
can  verify  the  correctness  of  the  data  returned.  If  the  replies  from  replicas  do  not  agree,  the  client 
re-issues  the  read  operation — for  the  replies  to  agree,  the  read-only  request  must  arrive  at  2h  -f  1  of 
the  replicas  in  the  same  order  (with  regard  to  other  write  operations).  The  re-issued  read  operation 
performs  agreement  before  a  single  replica  replies. 

The  BET  configuration  does  not  store  data  to  stable  storage,  instead  it  stores  all  data  in  mem¬ 
ory  and  accesses  it  via  memory  offsets.  Eor  the  experiments,  checkpointing  and  view  changes  are 
suppressed.  BET  uses  UDP  connections  rather  than  TCR  The  BET  implementation  defaults  to  us¬ 
ing  IP  multicast;  however,  IP  multicast  was  disabled  for  all  experiments.  The  BET  implementation 
authenticates  broadcast  messages  via  authenticators,  and  point  to  point  messages  with  MACs. 
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b=l 

b=2 

b=3 

b=4 

Erasure  coding 

(ms) 

1.25 

1.50 

1.73 

1.99 

Cross  checksum 

(ms) 

0.36 

0.44 

0.48 

0.51 

Verifier 

(a/s) 

1.64 

2.31 

3.61 

4.28 

Validate 

(a/s) 

81.5 

58.0 

48.3 

40.1 

Table  1:  Computation  costs  in  PAS  IS  (1  GHz  CPU) 


5.3  Performance  and  scalability 

5.3.1  Mechanism  costs 

Client  and  storage-node  computation  costs  in  PASIS  are  listed  in  Table  1.  For  every  read  and 
write  operation,  clients  perform  erasure  coding  (i.e.,  they  compute  N  —  m  data-fragments  given  m 
data-fragments),  generate  a  cross  checksum,  and  generate  a  verifier.  Recall  that  writes  generate  the 
first  m  data-fragments  by  striping  the  data-item  into  m  fragments.  Similarly,  reads  must  generate 
N  —  m  fragments,  from  the  m  they  have,  in  order  to  verify  the  cross  checksum.  The  cost  of  erasure 
encoding,  cross  checksumming,  and  creating  the  verifier  grow  with  m  and  N. 

Storage-nodes  validate  each  write  request  they  receive.  This  validation  requires  a  comparison 
of  the  data-fragment’s  hash  to  the  hash  within  the  cross  checksum,  and  a  comparison  of  the  cross 
checksum’s  hash  to  the  verifier  within  the  timestamp.  The  cost  of  storage-node  validation  decreases 
with  data- fragment  sizes  sls  such,  it  decreases  with  t. 

5.3.2  Response  time 

Figure  4  shows  the  mean  response  time  of  a  single  request  from  a  single  client  as  a  function  of 
tolerated  number  of  storage-node  failures.  The  focus  in  this  plot  is  the  slopes  of  the  response 
time  lines:  the  flatter  the  line  the  more  scalable  the  protocol  is  with  regard  to  the  number  of  faults 
tolerated.  A  key  contributor  to  response  time  is  network  cost,  which  is  dictated  by  both  the  number 
of  nodes  and  the  space-efficiency  of  the  encoding.  PASIS  has  better  response  times  than  BFT 
for  write  operations  due  to  the  space-efficiency  of  erasure  codes  and  the  nominal  amount  of  work 
storage-nodes  perform  to  execute  write  requests. 

PASIS  has  longer  response  times  than  BFT  for  read  operations.  This  can  be  attributed  to  three 
main  factors:  First,  for  our  protocol,  the  client  computation  cost  grows  as  the  number  of  failures 
tolerated  increases  because  the  cost  of  generating  data-fragments  grows  as  m  increases.  Second, 
the  PASIS  storage-nodes  store  data  in  a  real  file  system;  although  the  data  is  serviced  from  cache, 
the  expenses  to  fetch  such  data  are  larger  than  a  single  memory  access.  The  BFT  setup  used  does 
not  incur  these  costs,  but  would  in  an  apples-to-apples  comparison.  Third,  BFT’s  implementation 
of  “fast  path”  read  operations  scale  extremely  well  in  terms  of  failures  tolerated.  Recall,  one  server 
returns  a  replica  plus  digest,  the  remainder  just  return  a  digest,  and  no  agreement  is  performed;  the 
operation  completes  in  a  single  round-trip  with  little  computation  cost. 

In  addition  to  the  b  =  t  case.  Figure  4  shows  one  instance  of  PASIS  assuming  a  hybrid  fault 
model  with  b  =  \  .  For  space-efficiency,  we  set  m  =  t  -|-  1,  which  is  t  —  b  greater  than  the  minimum 
allowed  value  of  m.  Consequently,  Qw  and  N  must  also  be  set  above  their  minimum  values.  The 
b  =  I  line  thus  corresponds  to  a  configuration  with  b  =  l,m  =  t+l,N  =  3t  +  2  and  2w  =  2t +  2. 
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Figure  4:  Mean  response  time  vs.  total  failures  tolerated.  Mean  response  times  of  read  and 
write  operations  of  random  16  KB  blocks  in  PASIS  and  BFT.  For  PASIS,  lines  corresponding  to 
both  b=l  and  b=t  are  shown. 

At  t  =  1,  this  configuration  is  identical  to  the  Byzantine-only  configuration.  As  t  increases,  this 
configuration  is  more  space-efficient  than  the  Byzantine-only  configuration,  since  it  requires  t 
fewer  storage-nodes.  As  such,  both  read  and  write  operations  scale  better. 

Some  read  operations  in  PASIS  can  require  repair.  A  repair  operation  must  perform  a  “write” 
operation  to  repair  the  value  before  it  is  returned  by  the  read.  Interestingly,  the  response  time 
of  a  read  that  performs  repair  is  less  than  the  sum  of  the  response  times  of  a  normal  read  and  a 
write  operation.  This  is  because  the  “write”  operation  during  repair  does  not  need  to  read  logical 
timestamps  before  issuing  write  requests.  Additionally,  data-fragments  need  only  be  written  to 
storage-nodes  that  do  not  already  host  the  write  operation. 

5.3.3  Scalability 

Figure  5  breaks  mean  response  times  of  read  and  write  operations  into  the  costs  at  the  client,  on 
the  network,  and  at  the  storage-node  iox  b  =  \  and  b  =  A.  Since  measurements  are  taken  at  the 
user-level,  kernel-level  timings  for  host  network  protocol  processing  (including  network  system 
calls)  are  attributed  to  the  “network”  cost  of  the  breakdowns.  To  understand  the  scalability  of  these 
protocols,  it  is  important  to  understand  these  breakdowns. 

Although  write  operations  for  both  protocols  have  similar  response  times  fox  b  =  1,  the  re¬ 
sponse  times  of  BFT  write  operations  scale  poorly.  The  large  network  cost  for  BFT  writes  is  due  to 
the  space-inefficiency  of  replication.  For  b  =  A,  BFT  has  a  blowup  of  13  x  on  the  network,  whereas 
our  protocol  has  a  blowup  of  ^  =  3.4x  on  the  network.  Note  that  BFT  can  use  IP  multicast  to 
mitigate  this  effect  on  write  operation  response  time,  though  the  aggregate  network  bandwidth 
utilization  would  still  grow. 

Regardless  of  whether  or  not  IP  multicast  is  employed,  our  protocol  in  PASIS  requires  much 
less  computation  on  storage-nodes  than  BFT  requires  on  replicas.  Server  cost  is  broken  down  to 
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Figure  5:  Protocol  cost  breakdown.  The  bars  illustrate  the  eost  breakdown  of  read  and  write 
operations  for  PASIS  and  BFT  for  b  =  1  and  b  =  4.  Eaeh  bar  eorresponds  to  a  single  point  on  the 
mean  response  time  graph  in  Figure  4 

distinguish  protoeol  eosts  from  the  eost  of  server  storage  (i.e.,  stable  data  storage).  Figure  5  shows 
that  as  b  inereases,  protoeol  related  server  eomputation  on  write  operations  grows  signifieantly  for 
BFT  and  barely  ehanges  for  PASIS.  In  PASIS,  the  server  protoeol  eost  deereases  from  90  fjs  for 
Z?  =  1  to  57  ;us  for  b  =  4,  whereas  in  BFT  it  inereases  from  0.80  ms  to  2.1  ms.  The  server  cost  in 
PASIS  decreases  because  m  increases  as  b  increases,  reducing  the  size  of  the  data-fragment  that 
is  validated.  Since  the  BFT  library  keeps  all  data  in  memory  and  accesses  blocks  via  memory 
offsets,  it  incurs  almost  no  server  storage  costs.  We  expect  that  a  BFT  implementation  with  stable 
data  storage  would  incur  server  storage  costs  similar  to  PASIS  (e.g.,  around  0.7  ms  for  a  write 
operation,  as  is  shown  for  Z?  =  1  in  Figure  5). 

By  offloading  work  from  the  storage-nodes  to  the  client,  PASIS  greatly  increases  its  scalability 
in  terms  of  supported  client  load.  Unfortunately,  a  direct  throughput  comparison  to  our  BFT  setup 
was  impossible  due  to  its  poor  stability  under  heavy  load.  We  have  observed  that  the  throughput  in 
PASIS  scales  according  to  network  and  server  CPU  utilization. 

Under  heavy  load,  the  response  time  of  read  operations  grows  in  PASIS  due  to  resource  con¬ 
tention.  The  same  is  true  of  BFT.  Moreover,  we  believe  that  under  load,  the  “fast  path”  for  read 
operations  in  BFT  becomes  less  effective:  Replicas  delay  read  replies  until  all  previous  requests 
have  committed,  and  the  likelihood  that  read  requests  are  similarly  ordered  at  distinct  replicas 
diminishes. 

5.3.4  Concurrency 

In  PASIS,  both  read-write  concurrency  and  client  crashes  during  write  operations  can  lead  to  client 
read  operations  observing  repairable  writes.  To  measure  the  effect  of  concurrency  on  the  system, 
we  measure  multi-client  throughput  when  accessing  overlapping  block  sets.  The  experiment  makes 
use  of  four  clients,  each  with  four  operations  outstanding.  Each  client  accesses  a  range  of  eight 
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data  blocks,  with  no  outstanding  requests  from  the  same  client  going  to  the  same  block. 

At  the  highest  concurrency  level  (all  eight  blocks  in  contention  by  all  clients),  we  observed 
neither  significant  drops  in  bandwidth  nor  significant  increases  in  mean  response  time.  Even  at  this 
high  concurrency  level,  the  initial  candidate  was  classified  as  complete  88.8%  of  the  time,  and  that 
repair  was  necessary  only  3.3%  of  the  time.  Since  repair  occurs  so  seldom,  the  effect  on  response 
time  and  throughput  is  minimal. 


6  Discussion 

Garbage  Collection:  Pruning  old  versions,  or  garbage  collection,  is  necessary  to  prevent 
capacity  exhaustion  of  the  backend  storage-nodes.  A  storage-node  in  isolation,  by  the  nature  of 
the  protocol,  cannot  determine  which  local  data-fragment  versions  are  safe  to  garbage-collect.  An 
individual  storage-node  can  garbage  collect  a  data-fragment  version  if  there  exists  a  later  complete 
write.  Storage-nodes  are  able  to  classify  writes  by  executing  the  consistency  protocol  in  the  same 
manner  as  the  client. 

Byzantine  clients:  In  a  storage  system,  Byzantine  clients  can  write  arbitrary  values.  The 
use  of  fine-grained  versioning  (i.e.,  self-securing  storage  [39]),  facilitates  detection,  recovery,  and 
diagnosis  from  storage  intrusions.  Arbitrarily  modified  data  can  be  rolled  back  to  its  pre-corruption 
state. 

Byzantine  clients  can  also  attempt  to  exhaust  the  resources  available  to  the  PASIS  protocol. 
Issuing  an  inordinate  number  of  write  operations  could  exhaust  storage  space.  However,  continu¬ 
ous  garbage  collection  frees  storage  space  prior  to  the  latest  complete  write.  If  a  Byzantine  client 
were  to  intentionally  issue  incomplete  write  operations,  then  garbage  collection  may  not  be  able 
to  free  up  space.  In  addition,  incomplete  writes  require  read  operations  to  roll-back  behind  them, 
thus  consuming  client  computation  and  network  resources.  In  practice,  relying  on  storage-based 
intrusion  detection  [27]  is  probably  sufficient  to  detect  clients  that  exhibit  such  behavior. 


7  Summary 

Storage-node  versioning  enables  an  efficient  protocol  that  provides  strong  consistency  and  liveness 
in  the  face  of  Byzantine  failures  and  concurrency.  The  protocol  achieves  scalability  by  offloading 
work  to  the  client  and  using  space-efficient  erasure  coding.  Measurements  of  the  PASIS  prototype 
demonstrate  these  benefits. 
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8  Appendix  I:  Proofs 

8.1  Proof  of  safety 

This  section  sketches  a  proof  that  our  protocol  implements  linearizability  [16]  as  adapted  appro¬ 
priately  for  a  fault  model  admitting  operations  by  Byzantine  clients.  Intuitively,  linearizability 
requires  that  each  read  operation  return  a  value  consistent  with  some  execution  in  which  each  read 
and  write  is  performed  at  a  distinct  point  in  time  between  when  the  client  invokes  the  operation 
and  when  the  operation  returns.  The  adaptations  necessary  to  reasonably  interpret  linearizability  in 
our  context  arise  from  the  fact  that  Byzantine  clients  need  not  follow  the  read  and  write  protocols. 
The  first  adaptation  is  necessary  because  return  values  of  reads  by  Byzantine  clients  obviously 
need  not  comply  with  any  correctness  criteria.  As  such,  we  disregard  read  operations  by  Byzantine 
clients  in  reasoning  about  linearizability,  and  define  the  duration  of  reads  only  for  those  executed 
by  benign  clients  only. 

Definition  1  A  read  operation  executed  by  a  benign  client  begins  when  the  client  invokes  READ 
locally,  and  completes  when  this  invocation  returns  {time stamp,  value) . 

The  second  needed  adaptation  of  linearizability  arises  from  the  fact  that  it  is  not  well  defined 
when  a  write  operation  by  a  Byzantine  client  begins.  Therefore,  we  settle  for  merely  a  definition 
of  when  writes  by  Byzantine  operations  complete. 

Definition  2  Storage-node  S,  accepts  a  write  request  with  data-fragment  D,  cross  checksum  CC, 
and  timestamp  ts  upon  successful  return  of  the  function  VALIDATE(D,  CC,  ts,  S)  at  the  storage- 
node. 

Definition  3  A  write  operation  with  timestamp  ts  completes  once  2w  —  b  benign  storage-nodes 
have  accepted  write  requests  with  timestamp  ts. 

In  fact.  Definition  3  applies  to  write  operations  by  benign  clients  as  well  as  “write  operations” 
by  Byzantine  clients.  In  this  section,  we  use  the  label  Wts  as  a  shorthand  for  the  write  operation 
with  timestamp  ts.  In  contrast  to  Definition  3,  Definition  4  applies  only  to  write  operations  by 
benign  clients. 

Definition  4  wts  begins  when  a  benign  client  invokes  the  WRITE  operation  locally  that  issues  a 
write  request  bearing  timestamp  ts. 

Lemma  1  Let  ci  and  C2  be  benign  clients.  If  c\  performs  a  read  operation  that  returns  {tsi,v\), 
C2  performs  a  read  operation  that  returns  {tS2,V2),  and  tsi  =  tS2,  then  vi  =  V2. 

Proof  sketch:  Since  ts\  =  tS2,  each  read  operation  considers  the  same  verifier.  Since  each  read 
operation  considers  the  same  verifier,  each  read  operation  considers  the  same  cross  checksum.  A 
read  operation  does  not  return  a  value  unless  the  cross  checksum  is  valid  and  there  are  more  than 
b  read  responses  with  the  timestamp  (since  only  candidates  classified  as  repairable  or  complete 
are  considered).  Thus,  only  a  set  of  data- fragments  resulting  from  the  erasure-coding  of  the  same 
data-item  that  are  issued  as  write  requests  with  the  same  timestamp  can  validate  a  cross  checksum. 
As  such,  vi  and  V2  must  be  the  same.  □ 
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Let  Vts  denote  the  value  written  by  Wts  which,  by  Lemma  1,  is  well-defined.  We  use  Vts  to 
denote  a  read  operation  by  a  benign  client  that  returns  {ts,  Vts)- 

Definition  5  Let  o\  denote  an  operation  that  completes  (a  read  by  a  benign  client,  or  a  write), 
and  let  02  denote  an  operation  that  begins  (a  read  or  write  by  a  benign  client).  o\  precedes  02  if  01 
completes  before  02  begins.  The  precedence  relation  is  written  as  01  — 02. 

Operation  02  is  said  to  follow,  or  to  be  subsequent  to,  operation  oi.  The  notation  oi  02  is 
used  to  mean  operation  oi  does  not  precede  operation  02. 

Lemma  2  Ifw^/  is  a  write  operation  by  a  benign  client  and  ifwts  Wts',  then  ts  <  ts'. 

Proof  sketch:  Constraint  (2),  the  overlap  constraint,  ensures  that  the  first  phase  of  WRITE 
achieves  a  higher  timestamp  than  any  preceding  write  operation.  A  complete  write  operation  exe¬ 
cutes  at  at  least  Qw  —  b  benign  storage-nodes.  Considering  Qr  TIME_RESP0NSE  messages  ensures 
at  least  one  such  response  is  from  a  correct  storage-node  that  executed  the  preceding  write  opera¬ 
tion  (line  8  of  WRITE).  A  Byzantine  storage-node  can  return  a  logical  timestamp  greater  than  that 
of  the  preceding  write  operation;  however,  this  still  advances  logical  time  and  Lemma  2  holds.  □ 

Observation  1  Timestamp  order  is  a  total  order  on  write  operations.  The  timestamps  of  write 
operations  by  benign  clients  respect  the  precedence  order  among  writes. 

Lemma  3  If  some  read  operation  by  a  benign  client  returns  {ts,  Vts),  and  ifwts  — ^  then  ts  <  ts'. 

Proof  sketch:  By  Definition  3,  since  Wts  completes,  there  are  2w  —  b  benign  storage-nodes  that 
accept  write-requests  with  timestamp  ts.  Storage-node  crashes  and  the  asynchronous  environment 
can  “hide”  up  to  t  of  the  2w  —  b  accepted  write  requests  from  As  such,  at  least  2w  —  t  —  b 
responses  with  timestamp  ts  are  observable  by  rts'',  a  read  operation  that  observes  a  candidate  with 
at  least  2w  —  t  —  b  responses  performs  repair  (line  10  of  READ).  Since  returns  {ts,  Vtf) ,  Vts  can  be 
returned  from  a  read  operation  performed  by  a  benign  client.  Thus,  r,/  either  repairs  Vts,  observes 
Vts  as  complete,  or  observes  some  value  with  a  timestamp  higher  than  ts.  □ 

Observation  2  It  follows  from  Lemma  3  that  if  rts  — ^  then  ts  <  ts' .  As  such,  there  is  a  partial 
order  -<  on  read  operations  by  benign  clients  defined  by  the  timestamps  associated  with  the  values 
returned  (i.e.,  of  the  write  operations  read).  More  formally,  rts  -<  rts'  ts  <  ts'. 

Ordering  reads  according  to  the  timestamps  of  the  write  operations  whose  values  they  return 
yields  a  partial  order  on  read  operations.  Lemma  3  ensures  that  this  partial  order  is  consistent 
with  precedence  among  reads.  Therefore,  any  way  of  extending  this  partial  order  to  a  total  order 
yields  an  ordering  of  reads  that  is  consistent  with  precedence  among  reads.  Lemmas  2  and  3 
guarantee  that  this  totally  ordered  set  of  operations  is  consistent  with  precedence.  This  implies  the 
natural  extension  of  linearizability  to  our  fault  model  (i.e.,  ignoring  reads  and  durations  of  writes 
by  Byzantine  clients);  in  particular,  it  implies  linearizability  as  originally  defined  [16]  if  all  clients 
are  benign. 
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8.2  Proof  of  liveness 


Our  protocol  provides  a  strong  liveness  property,  namely  wait-freedom  [14,  18].  Informally,  each 
operation  by  a  correct  client  completes  with  certainty,  even  if  all  other  clients  fail,  provided  that  at 
most  b  servers  suffer  Byzantine  failures  and  t  servers  fail  in  total. 

Lemma  4  A  write  operation  by  a  correct  client  completes. 

Proof  sketch:  A  write  operation  by  a  correct  client  waits  for  Qw  storage-nodes  to  execute  write 
requests  before  returning  (line  7  of  D0_WRITE).  Since,  A  — t  storage-nodes  are  always  available  and 
2w  <  A  —  t,  write  operations  always  terminate.  □ 

Lemma  5  A  read  operation  by  a  correct  client  completes. 

Proof  sketch:  Given  N  —  t  READ_RESPONSE  messages,  a  read  operation  classifies  a  candidate  as 
complete,  repairable,  or  incomplete.  The  read  completes  if  a  candidate  is  classified  as  complete.  As 
well,  the  read  completes  if  a  candidate  is  repairable.  Repair  is  initiated  for  repairable  candidates — 
repair  performs  a  write  operation,  which  by  Lemma  4  completes — which  lets  the  read  operation 
complete.  In  the  case  of  an  incomplete,  the  read  operation  traverses  the  version  history  backwards, 
until  a  complete  or  repairable  candidate  is  discovered.  Traversal  of  the  version  history  terminates 
if  ±  at  logical  time  0  is  encountered  at  Qw  storage-nodes.  □ 
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